## Warning: package 'ggplot2' was built under R version 3.5.2
## Warning: package 'tibble' was built under R version 3.5.2
## Warning: package 'tidyr' was built under R version 3.5.2
## Warning: package 'purrr' was built under R version 3.5.2
## Warning: package 'dplyr' was built under R version 3.5.2
## Warning: package 'stringr' was built under R version 3.5.2
## Warning: package 'DT' was built under R version 3.5.2
## Warning: package 'knitr' was built under R version 3.5.2
## Warning: package 'tidyquant' was built under R version 3.5.2
## Warning: package 'PerformanceAnalytics' was built under R version 3.5.2
## Warning: package 'zoo' was built under R version 3.5.2
## Warning: package 'quantmod' was built under R version 3.5.2
## Warning: package 'TTR' was built under R version 3.5.2
## Warning: package 'cowplot' was built under R version 3.5.2
The adverse affects of air pollution on health are well documented and air pollution can lead to a large range of diseases and increased morbidity and mortality (Younger et al., 2008). Adverse health impacts include, but are not limited to, lung cancer risk, respiritory infections, allergic disease and asthma (Younger et al., 2008; Shea et al., 2008). These health risks can affect a large proportion of the population as many different groups are vulnerable to the effects of air pollution including infants, children, the elderly, people with impaired immune systems, and people who work or are physically active outdoors (Matooane et al., 2004).
Because of the many, and severe, impacts of air quality, it is important to understand patterns in the data. We have a dataset of air quality observations as well as temperature and humidity data which we will use to gain understanding of the patterns and impacts of weather on air quality.
The air quality dataset used in this analysis was obtained from the University of California Irvine Machine learning Repository. It was contributed by Saverio De Vito from the National Agency for New Technologies, Energy and Sustainable Economic Development.
The dataset contains 15 variables and 9358 observations of hourly averaged responses from an Air Quality Chemical Multisensor Device. Data were recorded from March 2004 to February 2005, in a significantly polluted area, at road level, within a city in Italy. Variables include the date and time each response was recorded, and the corresponding concentrations of 13 air pollutants analyzed by the sensor device. Missing values are tagged with -200 value. Below is the entire variable set:
| Variables | Type | Description |
|---|---|---|
| Date | character | Date (DD/MM/YYYY) |
| Time | time | Time (HH.MM.SS) |
| CO(GT) | double | True hourly averaged concentration CO in mg/m^3 (reference analyzer) |
| PT08.S1(CO) | integer | PT08.S1 (tin oxide) hourly averaged sensor response (nominally CO targeted) |
| NMHC(GT) | integer | True hourly averaged overall Non Metanic HydroCarbons concentration in microg/m^3 (reference analyzer) |
| C6H6(GT) | double | True hourly averaged Benzene concentration in microg/m^3 (reference analyzer) |
| PT08.S2(NMHC) | integer | PT08.S2 (titania) hourly averaged sensor response (nominally NMHC targeted) |
| NOx(GT) | integer | True hourly averaged NOx concentration in ppb (reference analyzer) |
| PT08.S3(NOx) | integer | PT08.S3 (tungsten oxide) hourly averaged sensor response (nominally NOx targeted) |
| NO2(GT) | integer | True hourly averaged NO2 concentration in microg/m^3 (reference analyzer) |
| PT08.S4(NO2) | integer | PT08.S4 (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted) |
| PT08.S5(O3) | integer | PT08.S5 (indium oxide) hourly averaged sensor response (nominally O3 targeted) |
| T | double | Temperature in °C |
| RH | double | Relative Humidity (%) |
| AH | double | AH Absolute Humidity |
The dataset is shown below:
# first we read the data in
airq <- readr::read_csv(here::here("Data","airquality.csv"))
## Parsed with column specification:
## cols(
## Date = col_date(format = ""),
## Time = col_time(format = ""),
## `CO(GT)` = col_double(),
## `PT08.S1(CO)` = col_integer(),
## `NMHC(GT)` = col_integer(),
## `C6H6(GT)` = col_double(),
## `PT08.S2(NMHC)` = col_integer(),
## `NOx(GT)` = col_integer(),
## `PT08.S3(NOx)` = col_integer(),
## `NO2(GT)` = col_integer(),
## `PT08.S4(NO2)` = col_integer(),
## `PT08.S5(O3)` = col_integer(),
## T = col_double(),
## RH = col_double(),
## AH = col_double()
## )
DT::datatable(airq)
The following shows the five-number stats summary for each variable:
# Five-number summary for each variable
summary(airq)
## Date Time CO(GT) PT08.S1(CO)
## Min. :2004-03-10 Length:9357 Min. :-200.00 Min. :-200
## 1st Qu.:2004-06-16 Class1:hms 1st Qu.: 0.60 1st Qu.: 921
## Median :2004-09-21 Class2:difftime Median : 1.50 Median :1053
## Mean :2004-09-21 Mode :numeric Mean : -34.21 Mean :1049
## 3rd Qu.:2004-12-28 3rd Qu.: 2.60 3rd Qu.:1221
## Max. :2005-04-04 Max. : 11.90 Max. :2040
## NMHC(GT) C6H6(GT) PT08.S2(NMHC) NOx(GT)
## Min. :-200.0 Min. :-200.000 Min. :-200.0 Min. :-200.0
## 1st Qu.:-200.0 1st Qu.: 4.000 1st Qu.: 711.0 1st Qu.: 50.0
## Median :-200.0 Median : 7.900 Median : 895.0 Median : 141.0
## Mean :-159.1 Mean : 1.866 Mean : 894.6 Mean : 168.6
## 3rd Qu.:-200.0 3rd Qu.: 13.600 3rd Qu.:1105.0 3rd Qu.: 284.0
## Max. :1189.0 Max. : 63.700 Max. :2214.0 Max. :1479.0
## PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3)
## Min. :-200 Min. :-200.00 Min. :-200 Min. :-200.0
## 1st Qu.: 637 1st Qu.: 53.00 1st Qu.:1185 1st Qu.: 700.0
## Median : 794 Median : 96.00 Median :1446 Median : 942.0
## Mean : 795 Mean : 58.15 Mean :1391 Mean : 975.1
## 3rd Qu.: 960 3rd Qu.: 133.00 3rd Qu.:1662 3rd Qu.:1255.0
## Max. :2683 Max. : 340.00 Max. :2775 Max. :2523.0
## T RH AH
## Min. :-200.000 Min. :-200.00 Min. :-200.0000
## 1st Qu.: 10.900 1st Qu.: 34.10 1st Qu.: 0.6923
## Median : 17.200 Median : 48.60 Median : 0.9768
## Mean : 9.778 Mean : 39.49 Mean : -6.8376
## 3rd Qu.: 24.100 3rd Qu.: 61.90 3rd Qu.: 1.2962
## Max. : 44.600 Max. : 88.70 Max. : 2.2310
The following shows some preliminary info on the air quality dataset that we are using. We record the number of total observations, number of missing observations, percentage of missing values and the number of usable observations.
# Look at missing values for each variable
missing = list()
for(i in 1:15) {
l = length(which(airq[i] == -200))
missing[[i]] = l
}
obs = list()
for(i in 1:15) {
o = length(airq[[i]])
obs[[i]] = o
}
dfmissing = data.frame(Variables,
matrix(unlist(missing), nrow=length(missing), byrow=T),
matrix(unlist(obs), nrow=length(missing), byrow=T))
names(dfmissing)[names(dfmissing) == "matrix.unlist.missing...nrow...length.missing...byrow...T."] = "Count of Missing Values"
names(dfmissing)[names(dfmissing) == "matrix.unlist.obs...nrow...length.missing...byrow...T."] = "Total Observations"
dfmissing %>%
mutate(`% Missing Values` = `Count of Missing Values`/`Total Observations`*100) %>%
mutate(`Usable Observations` = `Total Observations` - `Count of Missing Values`)
From this we see that for many of the observations less than 4% of the data is missing. This is adequate for the research we are conducting.
#convert Missing Values (tagged with -200 value) to NA
airq[airq == -200] = NA
# Convert numeric columns to 'double' type
airq[3:15] <- sapply(airq[3:15], as.double)
airq_corr <- cor(airq[3:15], use = "complete.obs")
# Round the values to 2 decimal places
airq_corr <- round(airq_corr,2)
corrplot(airq_corr,
type="upper",
method="color",
tl.srt=45,
tl.col = "blue",
diag = FALSE)
Looking at the correlations of the pollutants with weather, we can see that for all pollutants except NOx, temperature (T) is positively correlated, although weakly so. This means that higher temperatures correspond to higher concentrations of the gases. Relative humidity (RH) is negatively and correlated to temperature and has a weak negative correlation to the concentrations of pollutants, except NOx. Absolute humidity (AH) has stronger correlations, mostly positive, although, like temperature, it has a negative correlation with NOx.
The following plot shows the hourly concentrations of some of the pollutants (tin oxide, benzene, and Titania) for a month. Also plotted are temperature and relative humidity.
#create a time stamp with both date and time (might want to move this to earlier section?)
airq = airq %>%
mutate(Date_Time = ymd_hms(paste(airq$Date, airq$Time)))
#data preparation: short to long, select pollutants to be included
airq.long = airq %>%
#select(Date_Time, `PT08.S1(CO)`, `C6H6(GT)`, `PT08.S2(NMHC)`,`PT08.S3(NOx)`, `PT08.S4(NO2)`, `PT08.S5(O3)`) %>%
select(Date_Time, `PT08.S1(CO)`,`C6H6(GT)`, `PT08.S2(NMHC)`) %>%
gather(key = "Variable", value = "Value", -Date_Time)
#head(airq.long)
weather.long = airq %>%
select(Date_Time,`T`, `RH`) %>%
gather(key = "Variable", value = "Value", -Date_Time)
#Graph of pollutants' concentration by hours
plot_1 <- airq.long %>%
drop_na(Value) %>%
ggplot(aes(x = Date_Time, y = Value)) +
geom_line(aes(color = Variable, linetype = Variable)) +
theme_bw() +
coord_x_datetime(xlim = c("2005-03-24 01:00:00", "2005-04-04 23:00:00")) +
xlab("Time") +
ylab("microg/m^3")
plot_2 <- airq %>%
ggplot(aes(x = Date_Time)) +
geom_line(aes(y=T, colour = "Temperature")) +
theme_bw() +
coord_x_datetime(xlim = c("2005-03-24 01:00:00", "2005-04-04 23:00:00")) +
xlab("Time") +
ylab("Temperature (Degrees C)") +
geom_line(aes(y=RH, colour = "Humidity")) +
scale_y_continuous(sec.axis = sec_axis(~., name = "Relative Humidity (%)"))
plot_grid(plot_1, plot_2, ncol=1)
The plots below now show the daily averaged concentrations of some of the pollutants (tin oxide, benzene, and Titania) for a year. Also plotted are daily temperature and relative humidity.
#Aggregate Daily Average
airq_daily = airq %>%
group_by(Date) %>%
summarise_all(funs(mean), na.rm = TRUE)
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
#data preparation: short to long, select pollutants to be included
airq.lg.d = airq_daily %>%
#select(Date, `PT08.S1(CO)`, `C6H6(GT)`, `PT08.S2(NMHC)`,`PT08.S3(NOx)`, `PT08.S4(NO2)`, `PT08.S5(O3)`) %>%
select(Date, `PT08.S1(CO)`,`C6H6(GT)`, `PT08.S2(NMHC)`) %>% #, `T`, AH) %>%
gather(key = "Variable", value = "Value", -Date)
#head(airq.lg.d)
weather.dly.long = airq_daily %>%
select(Date,`T`, `RH`) %>%
gather(key = "Variable", value = "Value", -Date)
#Graph of pollutants' concentration by day
plot_3 <- airq.lg.d %>%
drop_na(Value) %>%
ggplot(aes(x = Date, y = Value)) +
geom_line(aes(color = Variable, linetype = Variable)) +
theme_bw() +
coord_x_date(xlim = c("2005-01-04", "2005-04-04")) +
xlab("Time") +
ylab("microg/m^3")
plot_4 <- airq_daily %>%
ggplot(aes(x = Date)) +
geom_line(aes(y=T, colour = "Temperature")) +
theme_bw() +
#coord_x_datetime(xlim = c("2005-01-04", "2005-04-04")) +
xlab("Time") +
ylab("Temperature (Degrees C)") +
geom_line(aes(y=RH, colour = "Humidity")) +
scale_y_continuous(sec.axis = sec_axis(~., name = "Relative Humidity (%)"))
plot_grid(plot_3, plot_4, ncol=1)
We now show the weekly averaged concentrations of some of the pollutants (tin oxide, benzene, and Titania) for a year. Also plotted are daily temperature and relative humidity.
#Aggregate Weekly Average
airq_weekly = airq %>%
mutate(Week = floor_date(Date_Time, unit = "week")) %>%
group_by(Week) %>%
summarise_all(funs(mean), na.rm = TRUE)
#data preparation: short to long, select pollutants to be included
airq.lg.w = airq_weekly %>%
#select(Week, `PT08.S1(CO)`, `C6H6(GT)`, `PT08.S2(NMHC)`,`PT08.S3(NOx)`, `PT08.S4(NO2)`, `PT08.S5(O3)`) %>%
select(Week, `PT08.S1(CO)`,`C6H6(GT)`, `PT08.S2(NMHC)`) %>% #, `T`, AH) %>%
gather(key = "Variable", value = "Value", -Week)
#head(airq.lg.w)
#Graph of pollutants' concentration by week
plot_5 <- airq.lg.w %>%
drop_na(Value) %>%
ggplot(aes(x = Week, y = Value)) +
geom_line(aes(color = Variable, linetype = Variable)) +
theme_bw() +
#coord_x_date(xlim = c("2005-01-04", "2005-04-04")) +
xlab("Time") +
ylab("microg/m^3")
plot_6 <- airq_weekly %>%
ggplot(aes(x = Week)) +
geom_line(aes(y=T, colour = "Temperature")) +
theme_bw() +
#coord_x_datetime(xlim = c("2005-01-04", "2005-04-04")) +
xlab("Time") +
ylab("Temperature (Degrees C)") +
geom_line(aes(y=RH, colour = "Humidity")) +
scale_y_continuous(sec.axis = sec_axis(~., name = "Relative Humidity (%)"))
plot_grid(plot_5, plot_6, ncol=1)
We now show the monthly averaged concentrations of some of the pollutants (tin oxide, benzene, and Titania) for a year. Also plotted are daily temperature and relative humidity.
#Aggregate Daily Average
airq_monthly = airq %>%
mutate(Month = floor_date(Date_Time, unit = "month")) %>%
group_by(Month) %>%
summarise_all(funs(mean), na.rm = TRUE)
#data preparation: short to long, select pollutants to be included
#Graph of pollutants' concentration by month
plot_7 <- airq_monthly %>%
#select(Month, `PT08.S1(CO)`, `C6H6(GT)`, `PT08.S2(NMHC)`,`PT08.S3(NOx)`, `PT08.S4(NO2)`, `PT08.S5(O3)`) %>%
select(Month, `PT08.S1(CO)`,`C6H6(GT)`, `PT08.S2(NMHC)`) %>% #, `T`, AH) %>%
gather(key = "Variable", value = "Value", -Month) %>%
drop_na(Value) %>%
ggplot(aes(x = Month, y = Value)) +
geom_line(aes(color = Variable, linetype = Variable)) +
theme_bw() +
#coord_x_date(xlim = c("2005-01-04", "2005-04-04")) +
xlab("Time") +
ylab("microg/m^3")
#Graph of temperature and humidity by month
plot_8 <- airq_monthly %>%
ggplot(aes(x = Month)) +
geom_line(aes(y=T, colour = "Temperature")) +
theme_bw() +
#coord_x_datetime(xlim = c("2005-01-04", "2005-04-04")) +
xlab("Time") +
ylab("Temperature (Degrees C)") +
geom_line(aes(y=RH, colour = "Humidity")) +
scale_y_continuous(sec.axis = sec_axis(~., name = "Relative Humidity (%)"))
plot_grid(plot_7, plot_8, ncol=1)
#airq
newnames <- c("Date", "Time", "CO", "Tin_oxide", "Hydro_carbons", "Benzene", "Titania", "NOx", "Tungsten_oxide_NOx", "NO2", "Tungsten_oxide_NO2", "Indium_oxide", "Temp", "RH", "AH")
for (i in 1:ncol(airq)){
print(i)
names(airq)[i]=newnames[i]
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
## [1] 11
## [1] 12
## [1] 13
## [1] 14
## [1] 15
## [1] 16
airq
# rename_f = function(col, newname){
# names(airq)[col]=newname
# }
#new <- map(newnames, ~rename_f(cols,.x))
#new
airq %>%
ggplot() +
geom_line(aes(y=Benzene, x=Temp)) +
theme_bw() +
xlab("Temperature (Degrees C") +
ylab("Benzene concentration (microg/m^3")
## Warning: Removed 366 row(s) containing missing values (geom_path).
In this analysis, we will attempt to determine the effects of temperature and humidity on the concentration of air pollutants.
With our research question, we are interested in the hourly averaged concentrations of air pollutants, temperature and humidity. We will ignore variables which have too many missing data to increase the precision of this analysis. After dealing with the missing data, we will perform a linear regression analysis using OLS (ordinary least square) method. Coefficients of relevant variables will be plotted with confidence intervals.
S. De Vito, E. Massera, M. Piga, L. Martinotto, G. Di Francia, On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario, Sensors and Actuators B: Chemical, Volume 129, Issue 2, 22 February 2008, Pages 750-757, ISSN 0925-4005.
Matooane, M., John, J., Oosthuizen, R., and Binedell, M. 2004. Vulnerability of South African communities to air pollution. In: 8th World Congress on Environmental Health. Durban, South Africa: Document Transformation Technologies.
Shea, K., Truckner, R., Weber, R., and Peden, D. 2008. Climate change and allergic disease. Journal of Allergy and Clinical Immunology, 122(3): 443-453.
Younger, M., Morrow-Almeida, H., Vindigni, S., and Dannenberg, A. 2008. The Built Environment, Climate Change, and Health Opportunities for Co-Benefits. American Journal of Preventative Medicine, 35 (5): 517-526.